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In this paper, a dimensionality reduction is achieved in large datasets using 
the proposed distance based Non-integer Matrix Factorization (NMF) 
technique, which is intended to solve the data dimensionality problem. Here, 
NMF and distance measurement aim to resolve the non-orthogonality 
problem due to increased dataset dimensionality. It initially partitions the 


datasets, organizes them into a defined geometric structure and it avoids 


capturing the dataset structure through a distance based similarity 
measurement. The proposed method is designed to fit the dynamic datasets 
and it includes the intrinsic structure using data geometry. Therefore, the 
complexity of data is further avoided using an Improved Distance based 
Locality Preserving Projection. The proposed method is evaluated against 
existing methods in terms of accuracy, average accuracy, mutual information 
and average mutual information. 
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1. INTRODUCTION 

In recent years, large dimensional datasets have been generated in the presence of uncertainty, and 
they have been increasingly used in several applications like environmental monitoring, sensor networks, 
data cleaning, moving object management and data integration. The presence of uncertainty in large 
dimensional datasets is due to imprecise measurement, unreliable data transfer, privacy protection, repeated 
sampling and so on [1]. These applications create a demand for effective management of large dimensional 
datasets and their processing, which is the major issue in large database systems [2]. 

The data reduction [3] in large datasets reduces the data dimensionality and retains the data 
representation. The data reduction is selected to reduce the instances of a given dataset. In spite of many 
efforts to deal with such instances, data mining algorithm have under gone severe challenges due to the non- 
applicability of datasets with large instances. Hence, the computational complexity of the system increases 
with larger instances and leads to problems in scaling increased storage requirements and clustering 
accuracy [4]. The other problems associated with larger data instances include: improper association or 
interaction in the feature space, lack of ability to handle the large datasets with discrete variables, inability to 
classify the data and poor knowledge generation for a given query, and finally poor computation due to 
missing variables or low dimensional features or feature selection [5] in high dimensional datasets [6,7]. 

There are several dimensionality reduction techniques [8-26, 27] dealing with high-dimensional 
data [26]. The common strategy in all the literature includes reduction of dimensionality which is based on 
the variations in their class labels [27]. Hence, to boost the performance of learning in classification systems 
and to address the above problems, an effective unsupervised model is needed for eliminating the large 
dimensional datasets. This is usually carried out through the reduction or elimination of unwanted features 
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from the datasets [28-30]. Feature reduction in clinical data set is discussed in [31] and [32] explains the 
dimensionality reduction in kernel PCA. 

In this paper, we propose an Improved Distance based Locality Preserving Projections (IDLPP) 
technique for reducing the datasets which possess high dimensionality. The notion of the proposed system 
was inspired by the idea of LPP. In this paper NMF is used for eliminating the low dimensional features. 
The distance estimation is computed using a probabilistic distance measurement, which represents the 
estimation of probability between two different data samples. 

The main contributions involve the following: 

1. The proposed solution finds the similarity between the data samples using squared distance 
representation. 

2. The low dimensional features are eliminated using NMF. 

3. Finally, the proposed IDLPP technique is compared against other LPP methods using accuracy, average 
accuracy, NMI and average NMI. 

The outline of the paper is as follows: Section 2 gives the outline of LPP with the data partitioning 
technique, NMF metric estimation. Section 3 discusses the similarity measurement based on distance 
between the nodes. Section 4 evaluates the IDLPP with other LPPN experimentally and the results are 
discussed. Finally, section 5 concludes the paper. 


2. LOCALITY PRESERVING PROJECTIONS 

The LPP as an unsupervised method is used as a dimensionality reduction technique in data mining 
with larger datasets. This method handles the structure of such datasets in a better manner than principle 
component analysis. Further, the local dataset structure is preserved through the construction of adjacent 
graphs using the k nearest neighbor algorithm. 


2.1. Data partitioning 

Consider the two sample data x; and x; in a large dimensional dataset, which lie at closer proximity. 
The distance between these two samples is found through the k-nearest neighbor algorithm. This forms an 
edge between the data samples, and the weights of the two sample data are thus computed as, 


_ x=x;| (1) 


nxd 


Assume that the sample set at the parent node is represented as a matrix X = [x1, x2...x,]7€ R$, 


where n is the total examples in a sample set. The sample set is divided into two subsets i.e. left and right 
child node based on a decision, which is represented as: X\€ R"!*4, and X3 E R™”4. Each instance has its own 
attributes that is weighted through a combination weighted vector, say, w. This estimates the sample point 
project point (x) in matrix (X) along the orientation of the weighted vector, say, w: P(x) = w : x. After 
defining the split value p, the matrix (X) is divided into two values X and X2 based on the projection values 


{P(x), x EX}: 


xeX, if P(x)=w-x>p 
xeX, if P(x)=w-x<p 


and p considers the medium (m) of all matrix (X) projections: p = m = median {P(x;), x1 € X, i= ii, i2, ..., n}. 
N 
The similarity matrix is obtained based on S' = {5,} _ which finds the similarity estimation between the 
l, J= 
data samples (N). 
Assume two different samples x; and x; lie in a subspace at closer proximity, then the new data 
samples y; and y; will lie at the new subspace. Therefore, the estimation of projection vector (a) is carried out 
using the following equation, 


2 2 - 
05D (y,-¥,) 8, = 05D (x,a7 —x,a7) s, P= 12N. (2) 

ij y 

where, y; = x;a7 has a sample matrix (X). 
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Further, the diagonal matrix (D) is multiplied by (2) to attain the following relation using a Laplacian matrix, 
which is given by, 


0.59 (v: - F Si 
ij 
bs (xa Dax; )-}, (xa"S ax?) (3) 
i ij 


=X(D-S)a'X"a 
= Xa’ XLa 


where, 
D or D, =>. 5,8 the diagonal matrix and 


d 
L= D-Sis the Laplacian matrix. 
The diagonal matrix is further limited to find the objective function of LPP, which is given by the 
following condition. 


arg min XG EX "a (4) 
s.t. Xa" DX "a =1 


The optimal projection vector (a) is found by solving the generalized Eigen value problem. The following 
equation shows the optimal projection vector (a). 


XLX "a = AXDX"a (5) 
2.2. NMF Metric Estimation 

Assume the optimal projection vector (a) is approximated and applied over the features space (G), 
which represents the features vectors (F) of sample data (X). The feature vector is normalized to f= 1 and 
then gram matrix (FGF) is found for the obtained normalized feature vector using a metric (M). 


M =F'F, s.t. u u, =1, V/=1.,....¢ (6) 


The label information is avoided using a metric (M) that estimates the gram matrix and approximation of the 
sample data vector over the feature space is used to obtain the metric in feature space i.e. M = FTF. 


3. DLPP BASED SIMILARITY MEASUREMENT 
The Euclidean distance 7 between the vectors Xi = (Œi, x2 ... , xiv)’ and Xj = (xj1, Xj2... , xp)’ is given 


where, z is the squared distance between the vector X; and X; 


by, 


sx -xh 
Z 


a 2 
(xa -x4) 
d= 


1 


Z = 


The squared distance 7 is estimated between any two vectors X;, Xj€ X with one or more missing datasets. 
Hence, we assume that vector X; and vector X; are independent. Since the squared distance y is a transform of 
vector X; and vector Xj, the squared distance is regarded as a random variable. This takes into account the 
missing datasets, which are modeled below. Consider the squared distance 7 as a non-negative function, 
where the expected distance is given in terms of a Probability Density Function p(y), 


Improved Probabilistic Distance based Locality Preserving Projections Method... (Jasem M. Alostad) 


596 o ISSN: 2088-8708 
E[n|= [P(2)nan 
0 
The statistical model is used to resolve the above squared integral function, which is given as, 
D 2 
z= > Qa 
d=l 


Assume a component, say Xia OY xja, is missing in the given data space, then the value of z is 
considered as the summation of squared random variables (gy). Depending on [18], the distribution of 
summed g? is assumed to be Gamma function iff PDF of the random variables 9 is given by, 


p(¢)=h(9)|9~ exp{-B¢"} 


where a and £ are the distribution parameters and the value of a random variable is assigned to a constant (C), 
which is given by 


Vo: h(g)+ hg) =6. 


Assume zis a Gamma distribution that reasonably chooses a Nakagami [12] distribution for the 


2a-l 


expected value 7. The random variable is considered as a Nakagami function i.e. p~Nakagami(m, 2), 
which is obtained by using Jé ~ Gamma (a, B ) 
The Nakagami distribution is a function of two parameters (shape and spread) that models the 
scattered datasets and reaches the receiver through multiple paths. Based on the assumption 7~Nakagami(m, 
Q ), the expected value of the squared distance i.e. E(n) is given as: 


r(0.5+m) Q 


Ele T(m) m 


where m is the shape function of the Nakagami distribution and Q is the spread function of the Nakagami 
distribution, which is a Gamma function. 


4. EXPERIMENTAL EVALUATION 

The proposed IDLPP method is tested against three datasets, namely 20 Newsgroups data shown in 
Figure 1. Reuters 21578 data shown in Figure 2 and R52 data shown in Figure 3. Initially, the data is 
preprocessed using the trunc5 stemmer technique and POS Tagger technique. Then the stop word removal 
technique is used to remove the stop words and remaining words are accepted based on mutual information. 
The dataset sample selection is given in Table 1. 


m Class1 mClass2 Class3 ™Class4 mClassS 


NUMBER OF TRAIN NUMBER OF TEST DOCUMENTS TOTAL NUMBER OF 
DOCUMENTS DOCUMENTS 


TOTAL ATTRIBUTES 


Figure 1. Attributes of 20 newsgroups dataset 
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Figure 2. Attributes of reuters 21578 dataset 


m Class1 mClass2 mClass3 mClass4 mClassS 


st 
hs 
m 


TOTAL ATTRIBUTES 
o Ee 
253 
E 190 
251 
1083 
f 121 
A 2923 
E i51 
E: 
E 326 


Mm n 

N a r 

a s ro) ~ a 
NUMBER OF TRAINING NUMBER OF TEST DOCUMENTS TOTAL DOCUMENTS 

DOCUMENTS 


Figure 3. Attributes of R52 dataset 


Table 1. Dataset Sample Selection 


Samples R52 dataset 20 News Group dataset Reuters 21578 dataset 
Sample 1 7 7 6 
Sample 2 7 6 7 
Sample 3 6 7 7 
Sample 4 5 8 y4 
Sample 5 7 8 5 
Sample 6 8 7 5 
Sample 7 5 7 8 
Sample 8 7 5 8 
Sample 9 10 5 5 
Sample 10 5 5 10 
Sample 11 5 10 5 
Sample 12 10 10 0 
Sample 13 10 0 10 
Sample 14 0 10 10 
Sample 15 5 15 0 
Sample 16 > 0 15 
Sample 17 0 15 5 
Sample 18 20 0 0 
Sample 19 0 0 20 
Sample 20 0 20 0 
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4.1. Result discussion 

Figure 4 shows the results of accuracy between IDLPP and existing LPP methods in relation to 20 
samples. Figure 5 shows the results of average accuracy between IDLPP and existing LPP methods in 
relation to three datasets i.e. 20 news groups, Reuters 21578 and R52 datasets. The result shows that the 
proposed method obtains a higher accuracy rate than other methods. The discarding of irrelevant feature 
vectors from the dataset using the proposed method is efficient and more robust than other existing LPP 
methods, which is evident from the results. Figure 6 shows the results of NMI between IDLPP and existing 
LPP methods in relation to 20 samples. Figure 7 shows the results of average NMI between IDLPP and 
existing LPP methods in relation to three datasets i.e. 20 news group, Reuters 21578 and R52 datasets. 
The proposed method obtains higher NMI than other methods, which is due to the effective reduction of 
redundant data samples from the larger datasets. The use of NMF helps to reduce the feature vector and the 
use of distance based measurement reduces the distance between the dataset samples. 


—O— ELPP —O— Extended LPP 
>O CLPP —oO— AGLPP 
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Figure 4. Results of accuracy using IDLPP and other LPP methods 
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Figure 5. Results of average accuracy using IDLPP and other LPP methods 
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Figure 6. Results of NMI using IDLPP and other LPP methods 
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Figure 7. Results of average NMI using IDLPP 


Further, the proposed method and other existing methods have been tested over UCI datasets. Figure 
8 shows the classification accuracy of UCI datasets. The total number of instances, classes and dimensions 
are listed in Table 2 for evaluation. The estimation of classification accuracy between the proposed and 
existing methods has been tested and the result shows that the proposed method obtains higher classification 


accuracy than the other methods. This demonstrated the efficacy of the proposed method. 


Table 2. UCI Dataset Sample 


Dataset No. of No. of No. of 
Instances Classes Dimensions 

Anneal 898 5 90 
Breast Tissue 106 6 9 
Colic 368 2 60 
Hepatitis 155 2 19 
House 232 2 16 
Hypothyroid 368 2 60 
Promoter 106 2 57 
Sonar 208 2 60 
Wdbc 569 2 30 
Wine 178 3 13 
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CONCLUSION 


In this paper, an IDLPP method is presented that increases the rate of accuracy and Mutual 


Information over large dimensional text datasets to retrieve the results effectively for the given queries. The 
distance measurement has been carried out in a probabilistic way in IDLPP between the sample data vector 
and this reveals that there is a hidden geometric pattern. It also reduces high dimensional irrelevant samples 
in large datasets and the geometric information of the datasets is preserved and this has increased the 
robustness. The results show that that the IDLPP method yields an improved rate of accuracy and an 
improved rate of NMI over other LPP methods and it is an improved method to preserve the locality 
projections. 
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